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ABSTRACT *' ^ ' < " ^ > 

There are two major aspects to the cutoff score issue 
in criterion- referenced (CR) measurement: whether a cutoff score is 
actually needed for a CH i:.est , and the method .for setting the cutoff 
score if one is used. There are many faptors to b€ considered in 
determining a cutoff score. While cutoff scores are very oiften set 
arbitrarily (e.g.,-80X)r there have been many oethqds sugg^^ted to 
improve the quality of judgment or t6 uxilize , quantitative approaches 
of varying degrees of complexity and precision; these n^ethodis are 
reviewed in this paper. Although more research is needed tQ^$onfirm 
the value of< these suggested methods, se ver'S^jjjf^pear promisii 
(Author/RC) * ^ ^ 
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One of the issues which distinguishes criterion-referenced' . 
(CR) measurement fron nprm-ref erenced (NR) measurement is that 
of setting a "cutoff" or "mastery" score denoting the level of 
minimum a'cceptable performance on the segment of instruction 
covered by the test. /.This issue seldom arises in NR measure- 
ment, because NR interpretations ordinarily are made on a 
relative rather than an absolute basis. However, the cutoff' 
score issue is one of the key points , of controversy among the 
various conceptualizations of OR measurement. There are^ two 
major aspects 1^ the cutoff score issue: whether .a 'cutoff 
score is,^ in facit , Actually needed, and the method' whicji 
should be us.ed tos^^^tablish a cutoff score if one is to, be us^d. 

' '5 ''^ : . ■ * • 

Th^re is 'one position in the literature on CR measure- 



ment wHicti holds that a cutoff score is not considered neces- 
sary or rerl^vant? in this view, CR measurement- does not neces- 
sarily imply making absolute judgments^ This ^osiiijipn is well 
expressed by Nitkoy who^ takes the following view ^ 
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absolute interpretations can be estremely dangerous 
. . . Nothing in the nature of CR testing implies 
that anyone necessarily meet a' given standard of 
competence, only that such levels of competency be 
defined in terms of performance ^(Nitko, 1970^ p. 39) • 

While he does not ,rUle out the usefulness of a cutoff score 
in some situations, Nitko contends that the'^concept of CR 
testing does not necessarily require making a value judgment 
about whether flawless performance is possible. ' k 

Nitko prefers an empirical, decision-oriehted approach, 
where a cutoff score is not set on an a priori basis but only 
on an empirical basis in answer to a question such as, "What 
level of perfbrmance is required at one point in the instruc- 
tional sequence in order to maximize success at the next 
point in the sequence?" There is no inherent reason why this 
point could not differ among individuals and in different 
circumstances. 

This point of view* as expressed by Nitko is unusual. The 
•more typical view in the CR .literature is that the setting of 
a cutoff score is an inherent requirement of CR measurement 
which is often reflected even in the definition of a CR test. 
Depending on the writer, the requirement for a cutoff score 

r ^ 

may b§ considered as fundamental to the development of the CR 
^^test (e.g., Ivens, 1970;' Jackson, .1970) or it may be considered 

as essential in mg-king an interpretation of the score (e.g»,„^ 
'■■Fremer, 1972). ^ 

*• Usually, the task of a CR test ^is to place an examinee in 
one of two categories^-master or nonmaster — with a ^minimum 
number of classification errors. Most^ of the methods to be 
i described are concerned with this kind cf dichotomous 



decision. However, it also would be very reasonable to coji*- 
sider .models which divide examinees into three' or more 
categories (Harris, 197^1 • 

The establishment of a cutoff (or mastery) score involves 
some difficult methodological problems. However, it also, in- 
volves resolving some basic conceptual issues as well, Skager 
XNotfe 1) 'notes that "How to define the nature of any perform- 
ance that would indicate mastery of a somain of content remains 
a major conceptual problem (p. At a practical level, the 

fact that item diff ic^ilties for CR tests, as for other tests, 
can be easily influenced presents a danger of- making incorrect 
assumptions that any given score on a CR test represents an 
accurate judgment as to mastery or nonmastery of objective 
(Klein & Kosecoff, 1973) • Stanley, and Hopkins (1972) have 
demonstrated .that, as might be expected, items of 'widely dif- 
ferent difficulty can be written to fit the same instructional 
objective. Therefore criteria sueh^aa 90 percent are arbi-^ 
^trary and nearly meani'ngless . in the absehce of a definitive 
extemal^'X^Q,renc^ point. Furthermore, an arbitrary cutoff 
level, such as. SO^^^^^^rnplici^ly assumes that: all items are. of 
equal importance (Kifer & Bramble, 1.^7^ ).f a quitq unreasonable 
assumption. \^ . ^ ^ 

Setting a minimum perf ornf)^nee level prior to ^ instruction 
is particularly appropriate when mastery is. essential to the 
subsequent, attaiment. of other important (|i?gectives (Sullivan', 
T9694-^__In^ other words, the se-tting of cutoff level for an . 
objective should propaH-T-^^e^t- a dete^^mination .^^.t thti 
attainmient of the objective has inVt^^c^ional significance. ' , 



The Standards for Educational and Psychological Tests (American 
PsychQlogical A-ssociation, et al.% 197^) also require that a 
rationale be provided for the selection of a cutoff score used 
in test interpretation. To aco^p.t^.this goal-, however, does not 
determine how it can be achieved?. What . is desired is to 
minimize the number of incorrect classification decisions, but 

neither classical (NR) procedure^ nor item sampling approaches 

' ^ ^^^^ ^ 

are very effective in individual decisionmaking (Haladyna, %97 5) • 

Like many issues inTR measurement, the issue of setting 

a mastery standard has a long history. Monroe (1917), for 

example, discussed the issue at som? length and concluded that ^/ 

a standard must meet two conditions: that it be reasonable and ^ ^' 

that it be "efficient." A reasonable standard was defined by /' 

/ 

Monroe as one which realistically can be attaii;ied by sxudentsi 
and an "efficient" standard was defined as one which repres>ents 
a level of |)erformance which equips students for meeting^^resent 
and future demands. 



/ 



/ 



The level at which a cutoff score should be^ set ^will vary 
depending- upon the cruciality of the objective; for very impor- 
tant objectives, the appropriate cutoff level may be quite high. 
Two other important factors should be considered in establishing 
performance standards: the difficulty level^^<^ the instructional 
content (insofar as this can be determined/' independently of 



actual^ learner performance), and the ampoint of instructional 
'time available relative to the instruct icjnal material to be 
\ covered. , . / 




perforipance standarc^s may te established' ..n on:^ ^ 
l|ing a cutoff score which must be attained 

5 



by each, individual learnerr or by specifying a group standard 
in terms of the percentage of sl^dents in the class* or other 
target group who will be expected to attain the individual 
cutoff score if the instruction is successful. The latter 
approach is much less common* 

An important point in the setting of the criterion level 
was made by Kri-ewatll and Hirsch (l969)r who pointed out that 
"setting a higher error criterion does not of itself improve 
the proficiency found in those examinees classified as 
masters (p.' 8)." Similarly, poorer-performing individuals do 
not become even less proficient .by being desi|nated as non- 
masters. That is, the distribution of actual proficiency is, 
independent of the, imposition of proficiency standards 
(Gardner, 1962). Simply by moving t^^^standard , the proportion 
of masters or nonmasters can be changed without having any 
effect upon the distribution pf actual performance. Mastery . 
standards cannot be set independent of the performance of the 
individuals involved; the level of performance which may be 
required for mastery must be realistic in terms of the pre-^ 
vailing levels of competence (Garvin, 1971 )• i 

There rarely is a clear basis upon which to establish a 
cutoff score in educational situations. In the absence of, 
otjier evidence, the cutoff score is most commonly set on some 
subjective basis relying on informed judgment, Gronlund (1973) i 
for example, has offered a step-by-step, trial-and-error pro- 
c1&dur^ i^ which an initial arbitrary standard is then adjusted 



upward or do\vnwa,rd Ion the basis of experience and judgh^nt. 
Gronlund proceeds from the sugge stion 'of Block (1971) that, 
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although 100 percent mastery rnigh^^ seem^ to be^ the^ id^^', 80 
to 85 percent correct is a more realistic standard. 



Gronlund's procedure is v/ork|^e , but it is only a" 
general guide to exercising ioi^'onnfed judgment. cThe procedure 
does not unequivocably determine the si^e of the adjustments 
or the level 0^ the final standard. Like Nitko, Gronlund 
(1973) concluded that the ultimate question is, "What level 
of jnastery is necessary in order to learn effectively at the 
next stage of instruction (p. 13)?" 

•Such subjectivity in setting cutoff leve.ls represents a 
serious shortcoming of CR measurement to some writers such as 

' Ebel (1973) • Ebel has also pointed out that relative st^nd- • 
ards, as in NR measurement, which are derived from the a^eraggf^4 

' performances^ of grdups o^ examinees are more stable than, > 
absolute standards based* on'the judgments of individual dfi- 
St rue tors . - , , y, . .t 

<^Millman (1973) ij^mong those critical of* routinely using 
a, single percentage standard in all content domains ari4 for 
all individuals. He suggests five approaches which might be 
used to standardize and ^refine the- app|icat ion of judgment in^ 
establishing standards of achievement. . < * ' 

./I. Standards established on the basi^ of the actual past 
performa^ic^ of typacar per^sons, so that sojne predetermined per- 
centage of persons will pass. This approach' is applicable 
when only a fixed ni|mber or proportion can be permitted to 

'"pass," and it probably resembles typical NR practice "^more 
than it does- criterion-referencing. Block (1971) and i;i(ein 

,.(1972) likewise suggest transferring existing grading i^tandards 



^ ... . -• . - ,- . . .7 < 

set, under' non-mastery ' conditiong to Jhe mastery situation. 
\' ~ 2. Standa^rds, established' oji the basis. of informed "judgment , 
on a-n item-by-^-tem. basi's^ as to how important it is that each 
item oh a' test be -:^nswelred correctly. \ This Would preferably ' 
. be^done>by ,a panel' of informed judges working under standardized 
prcTC^dures* N^delsky's (195^) procedure for assigning grades, 
described ^tielow, may be -^considered as one Variation of this 
.^'appjcoath,, and Aul^s and Pearson's (Note-2) method of "quanti- 
* f yii^g intuitions" is another. Still another variation of this 
'approach, for'the test as a whole,' has been suggested by 

kri^M'" <19?3) • 

' 3^ /Standards est^iblished on the basis ' of educat|.onal 

consequences "in "terms of future learniVig. Higher cutoff scores 
.fnay be req^uired for fundamental or prerecjuisite learnings* 

Weighted ""regression equations or exf^ectancy tables may be use- 
■*'ful in thts appTroach. / . 

4. Standards esta-Slished on -the baais.cf psychological 
* or financial 6ost3',. so far as th'ese can be determined . A high ^ 
.'cutoff may be- justified v^hen the cost of false' a^dvances is high; 
' ; ^ 5. standards ef^iablished with allowance made' for meas- 
J^Ulrymenit ^e^or du6- fq-'pure gue^s^ng,/ or fqr the effects of n6n-^ 

^epre>s^l5£^tiyeness> of the ^iteirt sample. This latter 'a pprpa6h is 

act\i9.11:J^ a, si^ge^tiol^ fqf^ final refinement of whatever cuioff, ; 
: ^core 'ma^^ Tie'dea?ived by ^one of the ^ther' approaches. ' . 

Whiie-any of, the appFiOachers discussed by Millman would 
^ help to.impjrove the process of sjetting, cutoff scores, it must 

be *r.c'cogni^d that most of them^are relatively crude and ,d0- 
. pendent upon unsubstaht iat^d judgment and knowledge of actual 



_pefforra^oc^J^^Otfiea<jmo3r| -sy^gt.emiiix:' ^ppr^ab^^^io the- problerft'^'v 
■ .,\i\e' b'&^,n.ir^op<i.^e^'h^ the process 'of ' 

• '■'^■'11^ '■'.'*'"•- . A'. .' . \ ' , ~ i . .-. . 

■■ 5et ting Ta^ ■c^t6'f^^l|y.el'. w 9«i?^^- these nrdr^ systematic dp- . ^ 
^ proach««' ar'g'sta^l .enti-rely .judgmental, .aTO_ others'^ re s.ta- ^ - 

^ tdgtic;.aa^ in; nature:^ ; >- \:-; ' ' ^ ; , - /^N' " 

* "- ■ Nedelsky'^(195^'>:ad<ai;essed' the- prpbl-ifrjrh-o^ determining the'r 
— cufof'5f lfe.v«l indirectly,, thro ukh th? assignment; of grades^ In 

" •' Nedelstey*s apiJrbachV th6 '''Miniium Passing. Sco\re''. froin which • ,^ 

• ' * '* ' * '\ ■ . • • , • • 

Pther'^grades fere' scaled is derived solely oh' the basis of 

, pooJ.6d judgment y;ithout reference to the distribution of actua]. 

' ^ ' * ' • " ' ... ' ' 

obtained' scores; ,The' syst'enr, is. exact and unambiguo'ue in- its 

- i / -. ' - ■ ^ ' ~ V 

ai^plicatio-n., but it is not intuitiyeiy appealing. 'N$de.lsky's ^ 

procedure has not come into significant use. , .' / 

* ' Another more systercatic, b^lt "sl^iir som^whatVsub.jective ,^v^' 

,> ; ' - V ■ . <■ > 

- view of' establishing a cutoff score ,ie offered by Fxent^r Xl9^2) . ' 

■ ' : ' ' • " ■ : • » ' .i '' ~ •' * \'' - / '■' " 

Fremer suggested five methods by which a measure. can given y 

. _ . \ 1/ • i '■. 

CR meaning, . urging that mQre"than one of .the methods- should 'be 

applied in any part icular .situat\iy . ..Most of .Fremer 's meih<^ds 

are independent 'of knowledge of observed scores. ^ ^remer'.s 



sug^gested rpethods have Apt b^^^n ^-eyelbped in gi^ejat depi^tt; and ^\ 
they are nolJ uneq..u.ivoc^i In thqir ^res.ults . ^-Jhe^.are^ ]^ ^t^.^"'^^ 



l."-'The use. of ^ofl-i^st information .t.cyVet a minimal 'pe^^^^- * -.,;^ 

' " .~ ' . ' - " / ^t •. 

formanae standard, such iTs 'by, determining, a pHpri, that; 'o''nly,f " . ' 

the topXO?^ will "pas^^"' ' , .: : 

A 2.* T-eacher , judgments of individual "test ifqms'. to. efeti^nat^-^^ 

the proporticiro of a girpup of "^barely passing'iT.studertts ^wJip - . pi%^' \ 
\ ' • ■ . ^ ■ ' . -* . . ' - , - " ■ ■ ■ .--'^ ' . ' 

- ^* w\)uld answer. 'the item, -CorreotDy . ^'hV JudgrrfentS of a ,T[unib'erVo^f-"; ;, 

. ' knowlfedgeable raters are the rj. averaged to oi»taaA tn6 fn.tniinu«-v-* ■ ^ 

ERIC ■ - • ' . : , ^ -.r 

; y.\ •• , ."C,-:* 'A:.,., ''-''■■^■f. 



criteriori cutoff score for the total instrument. This .approach 

/ / ■ 

was followed by the Educational Testing Service to develop in- 

^ . struraents for the statewide Michigan Assessment Program, w'itl^ 

apparently satisfactory results, . ' ^> ^ . 

3» Teacher judgments regarding which students are ^pe^rform- 

ing at minimum competency levels, either through global jXidg'- . 

• \ents or. a .detailed analysis of stud^nt^berformance 



4, Developiueat of supplemental work sample tes'ts/as . \ ' 
criteria'' aga^inst which to validate the GR measure. ..This* math^Cl 



♦ 



closely resembles i^he traditional predictive .Validity appl2^-qach. 
/ ' Wgird' (Note 3) useci^an approach of this so:^ to •'validate" CR : 
tests ^used . in teapher training.' i^^:- 

Development -of^^hat FreMer, calls "^stand aXone^ S^o^^k 



^"sa'mpuLe, testq^', ^The^se are instruments constructed to .s^rve^^as » .^i,, 
> ^^^.diir^ct. me^sijtres of performance on objectives which are, considered 

' ' - ' V* v ^ ... ' / f ' V • " " ' ' " 

- s^x impotstant. that th§y shoyld be ^nreaSired -directly ,,e.ven thougrr' ' 

\^ '* ^ 

. ^ \ \^ ^ Q-ffaaifeti't indirect measureofentj^mighij^ be.^pps^^^ble . , f 

, ^ ■ : The^ preceding aapteaohes, mostly irepire^ent dif/er^.nt' ways^ 

. ^^-f,;^ . tiC^^^jrxivev-a.t.,a ifej^ft^d ^ji^d^me^nt ^s^ ta 1?he cutoff scor^* ,A *; ^ 

problem/ \ 
prxDposed • 




an4 often i)iytolve 



x.'I^Xv, y'iJ"|!tie^4f.,^^B. 'raoat^ is that of WJi: 

^Crf0,'t^.5')-'VHib.. fises\ tt)-e 4^'^omi^lsnvo'del ~'to derive tables represent- '• 

•;• • *v ■ VWjcr;-'a?l •-exrfic't'' :gf)llit .1 on- ±A- . thP !atrpc:t inrS- n-T h r>iu --Jva r>^r -i +13,^7^^:11 ^I'^' 



particular proficiency level. The establishment of a profi- 
ciency level to which the procedure would ap^ly is still left 
as liasically a subjective process. This approach is appropriate 
to the degree to which the items which comprise the test can be 
assumed to^ be a ramdom sample of a defined universe of items. 
An unknown degree of error would be introduced by using the' 
Millman approach for CR tests, in which the it^ras cannot be 
^assumed tb represent a defined universe. .Milkman's tables can 
be used' to answer eith-er of two types of mathematically parallel 

reciprocal ques'tions: (a) for individual..assessment , how many 

'» * . ». ' 

items are needed on a \est? ^nd (b) for prpgram evaluation, how 

^^.^many stude'nts should, be .Rested? T^he tables provide a means to 

xietermine ,th0 pro'portion of misclassrf ications .ibo be expected 
' - " * ' ' 

Myhen a> test of \j given . length is administered with a given 
^passing score. An elaboratio'n of Millman's ' binomial procedure 

has bepn proposed by jNovipk and Lev/is (197^) in which prior- 
"probabilities §.re used in a Bayesian model to relate observed 

test scores to tru\ level of functioning. 

^ A .^econci quantitative approach iS proposed by Block (1972). 
, His approach is ^ased on\ti^lizing students' future learning 

B.S a criterion for determining the level of proficiency 

(mastery sta^ridard) wi^ich students must have att^iined at inter- 
. mediate' stages of instruction. It is, in effect, an operational 

answer to the type pf question posed by Gronlund, >Iill-man, and 

^Jitko: "what level of mastery is needed in ordeV 'to sucqeed 

> * ^ ^ ' ^ \ ^ ^ - . . ' ' ' N 

at the next level of instruction?" Any of several '$tatistical\ ^ 

techniques may be used for this put^pose* That l^vel oj interim 

^performance which yields tha greatest estimated future^ learning 
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is selected^^s the mastery standard. Block's system appears 
to be sufficients to form a basis for designing 

an instructional st^Astegy. However, considerable subjectivity 
still remains in his sy^^m. * " * ^ \, 

A third' statistical apKroac)i is suggested by Kriewall 
(1972), who , suggests using. $Ln ^tera ^^mpling model to derive an 
exact probability th^'t^^ai^'^C^T'^ a false negative 

or false positive sresu^tT K^^i^wk'l'l' s 'procedure does not 
hequire an a^sumption^^pjf -eqil^l' item difficulty or an assump- 
tion of a true dichotonjy^ between mastery vs. nonma^tery 
status (^Millman, hut test statistics 9.re affected by 

tes5> length'.' Th^ procedure dGes not deter:mine th6 mastX^y 

' levej. \t be used, but it provides k*. basis for evaluating t)he 
outcome of any particular cutoff score chosen. . ^ . ' 

A pfomiiing, relatively simple but comprehensive statis- 
tical approach to the. cutoff score problem is offered by^ 
Emrick (1971). Emrick's ''skill-mastery test inodel". seeks to 
establish ma^stery (cutoff) scores which are optimized in terms 

, of relativ'e decision el^ror costs and relative item error ' 
probabilities. It combines item and student information \o 
produce*' probability statements regarding skill-mastery status. 
Skill mastery is treated as- an all-pr«none varriable . Emrick's 
mod§l has^ ^nsidera"ble statistical elegance. However, some of 
'the aSstiijiptio^viLinderlying tlie model such as homogeneity, , 
equal item difficuHies,' 3.nd equal item intercorrelatio^," 
frequently are* hot tenki^e' in, practice (Millinan, 1973)\ and 

^ the consequences of viola t^b<i^ these; assumptions are no*: kno.vn. . 

The Emridj/mcTdel al^o remains^ub je^t i\e at some key points, ' 

' ^ \o^> \ . ' ■ 
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but nevertheless appears to offer two advaritagesr (a) by care- 
fully isolating and defining the subjective judgments, it per- 
mits them to be made more accurately} and (b) for a given set . 
of input values, the formula yields the single ^optimum cutoff 

score* ■ 

» 

approach somewhat similar to Eij^rick's, but which 
attempts to avoid some of these difficulties, is offered by 
BeseL (l973)« Besel's "mastery learning test 'model" also 
differs from Emrick's in another important respect, by using 
an independent estimate of the proportion of students in a 
comparison group which have achieved an objective (which may 

be looked alt as normative data) as a check upon the accuracy 

{ 

of I individual prediction. This use of group estimates of prior 
prdbabilities resulted in significantly improved stability of 
mastery learning parameters and improved individual predictions, 
-Alth-ough Besel's model overcomes some of the limiting assumptions 
of Emrick's approach, it ^is mathematically complex and seems 
doubtful that Besel's approach offers. a worthwhile advantage' 
over Emrick's simple^ model. 

A still mpre complex approach to the cutoff score issue 
^ is the "decision-theoretic" approach (Hambleton & Novick, 1973; 

Swaminathan, Hambleton, & Algina, Note 6). This is a Bayesian^ 
procedure which allows" tha use of prior and collateral in- 
formation, and also incorporates the cost of misclassif ications . 
The , procedure deliberately introduces the decision^maker • s 
yalues into the decision process. The decision-theoretic 
approach will accomodate a three-category decision sysi.c.Ti. 
However, like some other approaches which have been dis;cussed| 

ErIc .13 . . 



the decision-theoretic approach does not actually ^establish ^' 
a cutoff score. Rather, it starts with an arbitrarily-s^et " 
cutoff score and then analyzes the cons^xrence^ of- oising that 
cutoff. ^ ^ 

Among all of these many techniques for establishing cutoff 
scores oo CR tests which have been discussed, and still others^ 
which have been suggested in the literature, one can be found 
to meet almost any situation in which a cutoff score must be 
established. Nearly any of the methods will be an improvement V 
over the too-common practice of arbitrarily setting" a cutoff, 
score. Although little research has been conducted to confirm 
the value of the various method^, several appear very piromising.. 
For classroom use by teachers not highly skilled in melasurement , 
several of the commonsense methods suggested" by Millman or 
Fremer should be useful, and, for larger applications, Emrick's 
"skill-mastery test model" particularily appears to warrant 
wider* use • 
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